Module 02 - Python Observability
What You Will Learn
By the end of this module you will be able to:
- Explain the three pillars of observability and why each one is irreplaceable
- Configure production-grade structured logging with
structlogand correlation IDs - Expose Prometheus metrics from a FastAPI service and write real PromQL queries
- Instrument a Python microservice with OpenTelemetry and visualise traces in Jaeger
- Capture, group, and alert on production exceptions with Sentry
- Build health check endpoints that Kubernetes actually trusts
Prerequisites
| Requirement | Why It Matters |
|---|---|
| Python 3.11+ | contextvars, asyncio, type hints used throughout |
| FastAPI basics | All production examples use FastAPI |
| Docker + docker-compose | Every tool in this module runs locally via compose |
| Module 1 complete | Async patterns and profiling context assumed |
| Basic SQL / PostgreSQL | Incident examples reference pg_stat_activity |
The Incident That Starts Every Observability Story
It is 14:23 on a Tuesday. Requests per second on your Python API are normal. HTTP 200s are flowing. No exceptions in Sentry. No alerts in PagerDuty. But your product manager has just forwarded a screenshot from a paying customer: every action in the app takes 8–12 seconds instead of the usual 400ms.
You open your logs:
INFO:uvicorn.access: 200 POST /api/documents 11432ms
INFO:uvicorn.access: 200 POST /api/documents 9871ms
INFO:uvicorn.access: 200 POST /api/documents 12103ms
The service is returning 200 OK. Latency is terrible. Logs show nothing useful - no query, no user, no context. You have no metrics so you cannot see when it started or whether it is getting worse. You have no traces so you cannot see where the 11 seconds are actually going.
Four hours later, after crawling through application code and guessing, someone runs this on the database host:
SELECT count(*), state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE datname = 'myapp_prod'
GROUP BY state, wait_event_type, wait_event
ORDER BY count DESC;
count | state | wait_event_type | wait_event
-------+--------+-----------------+------------
48 | active | Lock | relation
2 | idle | |
0 | ...
Connection pool exhaustion. The application pool was set to 10 connections. Under load it queued requests waiting for a free connection, each waiting up to the 30-second timeout. The service never returned an error because the requests eventually succeeded - just 11 seconds late.
Four hours of debugging for a problem that a single Prometheus gauge would have surfaced in four seconds.
This module is about never having that four-hour incident again.
Why Observability is Not Just Logging
Most engineers learn to "add logging" and consider observability done. That mental model breaks in production. Here is why the three pillars are each irreplaceable:
┌─────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
├─────────────────┬──────────────────┬────────────────────────────┤
│ LOGS │ METRICS │ TRACES │
│ │ │ │
│ "What happened │ "How much / │ "Where did the time go?" │
│ and when?" │ how often?" │ │
│ │ │ │
│ Discrete │ Aggregated │ Causal chain across │
│ events with │ numerical │ services, showing │
│ context │ measurements │ parent-child timing │
│ │ over time │ │
│ structlog │ Prometheus │ OpenTelemetry + Jaeger │
│ Loki │ Grafana │ │
│ Datadog Logs │ Alertmanager │ │
├─────────────────┴──────────────────┴────────────────────────────┤
│ ERROR TRACKING │
│ "Which exceptions are happening, how often, │
│ and what is their full context?" │
│ Sentry / GlitchTip │
├─────────────────────────────────────────────────────────────────┤
│ HEALTH CHECKS │
│ "Is this service safe to receive traffic right now?" │
│ /liveness /readiness /startup │
└─────────────────────────────────────────────────────────────────┘
Logs: What Happened
A log is a discrete event record. It has a timestamp, a severity level, a message, and ideally a rich set of structured key-value context fields. Logs answer questions like:
- "Which user triggered this error?"
- "What SQL query ran before the exception?"
- "What was the
document_idbeing processed when the worker died?"
Logs are high cardinality - you can store as much context per event as you need. They are bad at answering "how often does this happen?" because that requires reading and counting many log lines.
Metrics: How Much / How Often
A metric is a numerical measurement aggregated over time. It answers:
- "How many requests per second are we serving right now?"
- "What is the p99 latency of the
/api/classifyendpoint?" - "How many database connections are currently in use?"
Metrics are low cardinality - you cannot store per-user data in a Prometheus label without blowing up cardinality. They are great for alerting because they are pre-aggregated and cheap to query.
Traces: Where Did the Time Go
A trace is a causal chain of timed operations across a distributed system. A single user request might touch an API gateway, two microservices, a database, Redis, and an external LLM API. A trace shows you:
- The exact wall-clock time each service spent on the request
- Which service was the bottleneck
- The gaps between services (network, queues, serialisation)
- Whether a slow downstream dependency caused a cascade
Traces answer the question that logs and metrics cannot: "The request took 800ms total - where did that time go?"
The Mistake: Thinking One Pillar Is Enough
| Scenario | Logs alone | Metrics alone | Traces alone |
|---|---|---|---|
| High p99 latency | See individual slow requests but no pattern | See the spike but not why or which service | See the bottleneck if you have a trace |
| Exception spike | See exceptions with context | See the rate spike but no context | Traces show span errors but not exception details |
| Connection pool exhaustion | See timeout errors but not pool state | See the gauge and alert immediately | Not directly visible without custom spans |
| Which user was affected | Yes, if logged | No - metrics are aggregated | Yes, if user ID in span attributes |
You need all three. They are complementary, not redundant.
The print() Problem
Every Python developer starts with print(). Here is what is wrong with it in production:
# What most beginners write
print(f"Processing document {doc_id}")
print(f"Error: {e}")
# What production requires
import structlog
log = structlog.get_logger()
log.info(
"document.processing.started",
document_id=doc_id,
user_id=current_user.id,
file_size_bytes=doc.size,
content_type=doc.content_type,
)
The difference is not cosmetic. With print():
- There is no timestamp (or it is not machine-parseable)
- There is no severity level - you cannot filter for errors only
- There is no structured data - you cannot query
document_id = "abc123"in Kibana - It goes to stdout with no buffering control - under load it will block your event loop
- You cannot route it to different destinations (file, syslog, log aggregator)
- You cannot suppress it in tests without redirecting stdout
With structured logging, a log line becomes a queryable document:
{
"timestamp": "2026-03-07T14:23:01.234Z",
"level": "info",
"event": "document.processing.started",
"document_id": "doc_8f3a2c",
"user_id": "usr_99f1b4",
"file_size_bytes": 204800,
"content_type": "application/pdf",
"service": "document-api",
"version": "2.14.0",
"environment": "production",
"request_id": "req_7e9d3b"
}
That single line can be searched, aggregated, alerted on, and correlated with traces - automatically.
A Metric Is Not a Log
A common mistake is trying to use logs as metrics:
# Wrong: trying to use a log query as a metric
log.info("cache_miss", key=cache_key)
# Then querying: count(event="cache_miss") per minute in Kibana
This works at small scale. At production scale:
- Log ingestion has latency - your "metric" lags 30–60 seconds
- Log storage is expensive - you are paying per GB for numerical data
- Log queries are slow - COUNT queries on log indices are full scans
- Log cardinality is unlimited - one bad log statement with a UUID label creates billions of series
The right solution: a Prometheus counter.
from prometheus_client import Counter
cache_misses = Counter(
"cache_misses_total",
"Total cache misses",
["cache_name", "operation"],
)
# In your cache layer:
cache_misses.labels(cache_name="document_cache", operation="get").inc()
Now rate(cache_misses_total[5m]) in PromQL gives you real-time cache miss rate with no log parsing, no latency, and negligible storage.
A Trace Is Not a Metric
Another common mistake:
# Wrong: using a histogram to find which service is slow
request_latency.labels(service="downstream-api").observe(latency)
# This tells you the downstream API is slow
# But it does NOT show you why - is it the network? The DB? A specific query?
A Prometheus histogram tells you that the downstream API is slow. A distributed trace tells you why - it shows you every operation inside that service with its individual timing, the exact SQL queries that ran, the Redis lookups that happened, and the outbound HTTP calls that were made.
Use metrics for alerting. Use traces for root cause analysis.
The Observability Stack Used in This Module
All tools in this module are open source and run locally with docker-compose:
| Tool | Role | Port |
|---|---|---|
structlog | Structured logging library | (library) |
| Loki | Log aggregation and storage | 3100 |
| Promtail | Log shipper (files → Loki) | 9080 |
| Prometheus | Metrics scraping and storage | 9090 |
| Alertmanager | Alert routing and deduplication | 9093 |
| Grafana | Metrics and log dashboards | 3000 |
| OpenTelemetry Collector | Trace collection and routing | 4317/4318 |
| Jaeger | Distributed trace storage and UI | 16686 |
| Sentry (self-hosted) | Error tracking | 9000 |
Full docker-compose setup provided in Lesson 01.
Module Lessons
Lesson 01 - Structured Logging
The Python logging module internals, structlog pipeline configuration, correlation IDs via contextvars, JSON formatting, sensitive data masking, log aggregation with Loki, and async non-blocking log handlers. Transforms an unstructured service into one whose logs are instantly searchable.
Key deliverable: A logging_config.py module that any FastAPI service can drop in and immediately produce structured, correlated, JSON logs shipped to Loki.
Lesson 02 - Metrics with Prometheus
The Prometheus data model, all four metric types with real use cases, FastAPI auto-instrumentation, custom application metrics, PromQL for SRE work, Alertmanager rules, and a complete Grafana dashboard JSON.
Key deliverable: A metrics.py module with application-level metrics for a document processing service, 10 real PromQL queries, and 5 production alerting rules.
Lesson 03 - Distributed Tracing
OpenTelemetry Python SDK, auto-instrumentation for FastAPI / SQLAlchemy / Redis / HTTPX, custom spans for business logic, W3C trace context propagation, baggage, sampling strategies, and reading Jaeger waterfall diagrams.
Key deliverable: Full OpenTelemetry setup for a multi-service Python application with context propagation through HTTP, and trace IDs injected into log lines.
Lesson 04 - Error Tracking
Sentry Python SDK, enriching errors with user context and breadcrumbs, custom fingerprinting for error grouping, before_send hooks for sensitive data filtering, release tracking with source maps, and building an error triage workflow.
Key deliverable: A production Sentry configuration that groups errors intelligently, masks PII, and integrates with your release pipeline.
Lesson 05 - Health Checks and Readiness
Kubernetes liveness vs readiness vs startup probes, designing health checks that accurately reflect service health, parallel dependency checks with timeouts, SLOs and error budgets, synthetic monitoring, and health check anti-patterns.
Key deliverable: A complete /liveness, /readiness, and /startup implementation for a FastAPI service with PostgreSQL, Redis, and external API dependencies.
Observability Maturity Model
Before starting, assess where your service sits today:
| Level | Name | Characteristics |
|---|---|---|
| 0 | Dark | print() statements, no structure, errors discovered by users |
| 1 | Basic Logs | logging.basicConfig(), some log lines, unstructured text |
| 2 | Structured Logs | JSON logs with levels, timestamps, and some context fields |
| 3 | Correlated Logs | Request IDs in every log line, logs shipped to aggregator |
| 4 | Metrics | Prometheus counters/histograms, dashboards, basic alerts |
| 5 | Error Tracking | Sentry with user context, release tracking, error workflows |
| 6 | Tracing | Distributed traces, p99 from traces, traces linked to logs |
| 7 | Full Observability | SLOs, error budgets, synthetic monitoring, runbooks linked to alerts |
Most production Python services in the wild sit at Level 1 or 2. This module takes you to Level 7.
How to Work Through This Module
Each lesson follows the same structure:
- Opening incident - a real production failure caused by missing observability
- Concepts - the theory, explained through the lens of what the incident needed
- Working code - production-grade implementations, not toy examples
- Integration - how this pillar connects to the others
- Interview Q&A - five questions asked at senior/staff engineering interviews
Run each lesson's code examples locally. By the end of Lesson 03, you will have a fully instrumented Python service with logs, metrics, and traces all running in docker-compose, all visible in Grafana.
Quick Reference: The Golden Signals
Before diving into implementation, here are the four signals every production service must measure (from Google's SRE Book):
| Signal | What It Measures | Prometheus Metric Type |
|---|---|---|
| Latency | Time to serve a request (success vs error latency separately) | Histogram |
| Traffic | How much demand is hitting the system | Counter |
| Errors | Rate of failed requests (5xx, explicit failures, wrong results) | Counter |
| Saturation | How "full" the service is (CPU, memory, connection pools, queue depth) | Gauge |
These four metrics, exposed correctly, will catch 90% of production incidents before users notice them. Lessons 02 through 05 show you how to implement each one properly.
Let's build observable systems.
